# TDSI Dataset Archive Structure
# DOI: 10.7910/DVN/MY4XSD
# Date: 2025-06-22

## Archive Overview

This archive contains the complete dataset and reproducibility materials for the study:

**"Cartographies of Silence: Mapping and Quantifying Translation Deserts in Global Climate-Risk Zones"**

The archive is structured to ensure complete reproducibility of all analyses and findings presented in the study. All data, code, and documentation are provided in open formats.

## Directory Structure

```
TDSI_Dataset/
│
├── README.md                           # Overview and usage instructions
├── LICENSE.txt                         # CC-BY 4.0 license information
├── CITATION.cff                        # Citation information in CFF format
│
├── raw_data/                           # Original unprocessed datasets
│   ├── WorldClim_2.1/                  # Climate data
│   │   └── Database_1_WorldClim_2.1_Raw_Data.csv
│   ├── Copernicus_ERA5/                # Drought indices
│   │   └── Database_2_Copernicus_ERA5_Raw_Data.csv
│   ├── Ethnologue/                     # Language data
│   │   └── Database_3_Ethnologue_Raw_Data.csv
│   ├── Glottolog/                      # Language geographic data
│   │   └── Database_4_Glottolog_Raw_Data.csv
│   ├── Translation_Resources/          # Translation deployment records
│   │   ├── Database_5_TWB_Deployment_Records.csv
│   │   ├── Database_6_IFRC_Emergency_Records.csv
│   │   └── Database_7_UNHCR_Surge_Records.csv
│   └── Validation/                     # Validation cases
│       └── Database_8_Validation_Cases.csv
│
├── processed_data/                     # Processed and derived datasets
│   ├── tdsi_country_aggregate.csv      # Country-level TDSI aggregation
│   ├── country_iso3_mapping.csv        # Country to ISO3 code mapping
│   └── Top_150_Languages_Investment_Priority.csv  # Priority language list
│
├── qa_logs/                            # Quality assessment documentation
│   └── tdsi_data_quality_assessment.log  # Comprehensive QA log
│
├── imputation_scripts/                 # Code for handling missing data
│   ├── tdsi_data_imputation.py         # Main imputation script
│   └── imputation_report/              # Imputation quality reports
│       └── [visualization files]
│
├── scenario_outputs/                   # Monte Carlo results
│   ├── monte_carlo_results.md  # Summary of findings
│   └── data/                # Raw outputs
│       └── [1000 files]
│
├── analysis_code/                      # Analysis and visualization code
│   ├── 01_data_preparation.py          # Data preprocessing
│   ├── 02_tdsi_calculation.py          # TDSI index calculation
│   ├── 03_hotspot_identification.py    # Hotspot analysis
│   ├── 04_language_prioritization.py   # Language priority ranking
│   ├── 05_monte_carlo.py    # Uncertainty analysis
│   └── 06_visualization.py             # Figure generation
│
└── figures/                            # Publication-ready figures
    ├── Figure_1_Priority_Distribution.png
    ├── Figure_2_TDSI_Distribution.png
    ├── Figure_3_Population_Coverage.png
    ├── Figure_4_Investment_Matrix.png
    └── Figure_5_Geographic_Distribution.png
```

## Data Dictionary

### Raw Data Files

1. **Database_1_WorldClim_2.1_Raw_Data.csv**
   - Climate data from WorldClim 2.1
   - Variables: longitude, latitude, annual_mean_temp, annual_precipitation, temp_anomaly, precip_anomaly
   - Resolution: 0.5° grid cells
   - Temporal coverage: 1970-2000 baseline vs 2021-2040 projections

2. **Database_2_Copernicus_ERA5_Raw_Data.csv**
   - Drought indices from Copernicus ERA5
   - Variables: longitude, latitude, spi_12, spei_12, drought_frequency, drought_intensity
   - Temporal coverage: 1960-2023

3. **Database_3_Ethnologue_Raw_Data.csv**
   - Language data from Ethnologue 27
   - Variables: language_name, iso_639_3, speakers, language_family, primary_country, region

4. **Database_4_Glottolog_Raw_Data.csv**
   - Language geographic data from Glottolog 5.2
   - Variables: language_name, glottocode, longitude, latitude, polygon_wkt, area_km2

5. **Database_5_TWB_Deployment_Records.csv**
   - Translators without Borders deployment records
   - Variables: deployment_id, date, location, languages, duration, translators, resource_type
   - Temporal coverage: 2015-2024

6. **Database_6_IFRC_Emergency_Records.csv**
   - IFRC emergency response records
   - Variables: emergency_id, date, location, disaster_type, languages, translation_resources
   - Temporal coverage: 2015-2024

7. **Database_7_UNHCR_Surge_Records.csv**
   - UNHCR surge roster records
   - Variables: deployment_id, date, location, languages, duration, interpreters
   - Temporal coverage: 2015-2024

8. **Database_8_Validation_Cases.csv**
   - Validation cases for model evaluation
   - Variables: event_name, date, location, disaster_type, languages_needed, translation_delay, affected_population
   - 45 cases from 2018-2024

### Processed Data Files

1. **tdsi_country_aggregate.csv**
   - Country-level TDSI aggregation
   - Variables: country, iso3, belt, tdsi_country_mean, tdsi_country_median, language_count

2. **country_iso3_mapping.csv**
   - Mapping between country names and ISO3 codes
   - Variables: Primary_Country, ISO3

3. **Top_150_Languages_Investment_Priority.csv**
   - Priority language list for pre-disaster translation investment
   - Variables: Rank, Language_Name, ISO_639_3, Speakers, Primary_Country, Region, Climate_Risk_Score, Language_Density_Score, Translation_Availability, TDSI_Score, Investment_Priority, Recommended_Investment

## Usage Notes

1. All scripts are written in Python 3.9+ and require standard scientific Python libraries (pandas, numpy, scipy, matplotlib, seaborn).

2. The complete analysis pipeline can be reproduced by running the scripts in the `analysis_code` directory in numerical order.

3. Monte Carlo calculationss are computationally intensive and may require several hours to complete on standard hardware.

4. All geographic data uses the WGS84 coordinate reference system.

## Citation

If you use this dataset in your research, please cite:

```
Author(s). (2025). Cartographies of Silence: Mapping and Quantifying Translation Deserts in Global Climate-Risk Zones. Target, XX(X), XXX-XXX. https://doi.org/10.XXXX/XXXXX
```

And the dataset:

```
Author(s). (2025). Dataset for: Cartographies of Silence: Mapping and Quantifying Translation Deserts in Global Climate-Risk Zones [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/MY4XSD
```

## Contact

For questions regarding this dataset, please contact the corresponding author at: [email protected]

## License

This dataset is licensed under a Creative Commons Attribution 4.0 International License (CC-BY 4.0).
